Skip to content

Cache robots.txt rules per domain#385

Merged
simonsmallchua merged 3 commits into
mainfrom
work/robots-cache
May 12, 2026
Merged

Cache robots.txt rules per domain#385
simonsmallchua merged 3 commits into
mainfrom
work/robots-cache

Conversation

@simonsmallchua
Copy link
Copy Markdown
Contributor

@simonsmallchua simonsmallchua commented May 11, 2026

Summary

JobManager.GetRobotsRules previously refetched /robots.txt on every call.
The stream worker's job-info cache expires at five-minute intervals, so a
long crawl re-hit the origin every five minutes — and a 429 on
/robots.txt came back on the next read with no negative caching. On
2026-05-11 the merrypeople.com run produced two 429s on /robots.txt
five minutes apart. No functional damage (the code falls through to "no
restrictions") but unnecessary origin load and noisy logs.

This change adds a per-domain in-memory cache with:

  • Positive TTL: 1 hour. A long crawl now fetches /robots.txt once
    per hour at most.
  • Negative TTL: 60 seconds. A transient 429 on /robots.txt
    unblocks within a minute and stops re-hammering the origin in the
    meantime.
  • singleflight collapse: concurrent misses for the same domain land
    on one origin fetch instead of N parallel requests.
  • Domain key: lower(util.NormaliseDomain(domain)) so case-variant
    and scheme-prefixed inputs share one entry.

Restart loses the cache, which is fine — a single fetch on the next
read warms it. No Redis involvement (single worker box; in-memory is
the right granularity).

Test plan

  • go test -race -short ./... passes
  • scripts/security-check.sh clean
  • Tests cover:
    • hit-then-cache positive TTL (10 reads → 1 origin fetch)
    • error caching for negative TTL (5 reads of a 429 → 1 origin fetch)
    • positive TTL expiry triggers refetch
    • negative TTL expiry triggers refetch (and can recover to success)
    • case/scheme normalisation collapses onto one entry
    • 20-way concurrent miss → 1 origin fetch via singleflight

Risk

  • In-memory only — restart loses the cache. Acceptable: cold start
    triggers at most one fetch per domain seen during the next crawl
    window.
  • The cache lives on JobManager, which is a long-lived singleton in
    the worker process — no leak risk because the key space is bounded by
    the number of domains the worker has ever touched (≪ memory budget).

View in Codesmith
Need help on this PR? Tag @codesmith with what you need.

  • Let Codesmith autofix CI failures and bot reviews

Summary by CodeRabbit

  • Changed
    • Robots.txt and sitemap discovery now caches results per canonical domain (1h positive, 60s negative TTLs), collapses concurrent fetches and bounds refetching to reduce redundant origin requests and rate-limit responses.
  • Tests
    • Expanded coverage for positive/negative caching, TTL expiry, concurrent request collapsing, cancellation handling, and domain canonicalization.

Review Change Stack

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 11, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: b1944f46-9def-44b1-99bc-674a83cc0c8f

📥 Commits

Reviewing files that changed from the base of the PR and between d04f5c7 and 8c96105.

📒 Files selected for processing (1)
  • internal/jobs/robots_cache_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • internal/jobs/robots_cache_test.go

📝 Walkthrough

Walkthrough

JobManager caches robots.txt discovery per normalized, lowercased domain with configurable positive (1h) and negative (60s) TTLs, collapses concurrent misses via singleflight, and avoids caching caller-side cancellations. Tests and CHANGELOG updated to reflect behavior.

Changes

Robots Cache Implementation

Layer / File(s) Summary
Cache State & Dependency
internal/jobs/manager.go
Import golang.org/x/sync/singleflight; extend JobManager with robots cache map, RW mutex, and related fields; reposition callback type declarations.
Cache Configuration
internal/jobs/manager.go
Define default TTL constants (1h positive, 60s negative) and initialize cache map and TTL fields in NewJobManager.
GetRobotsRules with Caching
internal/jobs/manager.go
Rewrite GetRobotsRules to normalise/lowercase domain keys, return unexpired cached {rules, err}, use singleflight.Group to de-duplicate concurrent fetches, cache outcomes with success vs error TTLs, and skip caching for caller cancellations/deadlines.
Test Double & Helper
internal/jobs/robots_cache_test.go
Add stubRobotsCrawler (atomic call counter) and newJobManagerWithCrawler helper to inject controllable crawler behavior and observe origin call counts.
Cache Behavior Tests
internal/jobs/robots_cache_test.go
Add tests for positive & negative caching, zero TTL refetch behaviors, domain key normalization, concurrent-miss collapse, non-caching of context.Canceled, canonical lowercased domain used for origin fetch, and a compile-time interface assertion.
Changelog Documentation
CHANGELOG.md
Populate ## [Unreleased] with a "Changed" entry describing normalized-domain caching, TTLs, singleflight deduplication, and bounded refetch/429 handling.

Sequence Diagram(s)

sequenceDiagram
  participant Caller
  participant GetRobotsRules
  participant RobotsCache
  participant SingleflightGroup
  participant Crawler
  Caller->>GetRobotsRules: GetRobotsRules(domain, ctx)
  GetRobotsRules->>GetRobotsRules: normalise & lowercase domain
  GetRobotsRules->>RobotsCache: check cached {rules,err,expiresAt}
  alt cache hit (not expired)
    RobotsCache-->>GetRobotsRules: cached rules/err
  else cache miss or expired
    GetRobotsRules->>SingleflightGroup: Do(domainKey, fetchFn)
    alt first caller
      SingleflightGroup->>Crawler: DiscoverSitemapsAndRobots(ctx, domain)
      Crawler-->>SingleflightGroup: rules, err
      SingleflightGroup->>RobotsCache: store {rules,err,expiresAt} using TTL
    else concurrent callers
      SingleflightGroup-->>GetRobotsRules: wait & receive result from first caller
    end
  end
  GetRobotsRules-->>Caller: rules, err
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 A cache for robots, neat and keen,
Domains normalised, tidy and clean,
Singleflight hums to stop the throng,
TTLs keep answers quick and strong,
A little rabbit hops — cache fixed, song on!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately and concisely summarizes the main change: implementing caching of robots.txt rules per domain in the JobManager.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch work/robots-cache

Comment @coderabbitai help to get the list of available commands and usage tips.

@supabase
Copy link
Copy Markdown

supabase Bot commented May 11, 2026

Updates to Preview Branch (work/robots-cache) ↗︎

Deployments Status Updated
Database Mon, 11 May 2026 23:00:26 UTC
Services Mon, 11 May 2026 23:00:26 UTC
APIs Mon, 11 May 2026 23:00:26 UTC

Tasks are run on every commit but only new migration files are pushed.
Close and reopen this PR if you want to apply changes from existing seed or migration files.

Tasks Status Updated
Configurations Mon, 11 May 2026 23:00:29 UTC
Migrations Mon, 11 May 2026 23:00:30 UTC
Seeding Mon, 11 May 2026 23:00:32 UTC
Edge Functions Mon, 11 May 2026 23:00:32 UTC

View logs for this Workflow Run ↗︎.
Learn more about Supabase for Git ↗︎.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 11, 2026

Release Versions

App patch: v0.34.10v0.34.11

Changelog

Changed

  • JobManager.GetRobotsRules now caches results per normalised domain (1h
    positive TTL, 60s negative TTL), and collapses concurrent misses onto a single
    origin fetch via singleflight. A long crawl previously refetched /robots.txt
    every five minutes (stream worker's job-info TTL) and a 429 on /robots.txt
    returned on the next read; both are now bounded.

@codecov
Copy link
Copy Markdown

codecov Bot commented May 11, 2026

Codecov Report

❌ Patch coverage is 86.11111% with 5 lines in your changes missing coverage. Please review.
✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
internal/jobs/manager.go 86.11% 4 Missing and 1 partial ⚠️

📢 Thoughts on this report? Let us know!

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@internal/jobs/manager.go`:
- Around line 412-433: The code caches robots results under the lowercased,
normalised key but calls jm.crawler.DiscoverSitemapsAndRobots(ctx, domain) with
the raw domain, causing mismatched fetches and bad cache entries; update the
call inside jm.robotsGroup.Do to use the canonical key (pass key instead of
domain) so jm.crawler.DiscoverSitemapsAndRobots, any error formatting
(fmt.Errorf("jobs: fetch robots for %s"...)) and the cached entry all operate on
the same normalised origin string.
- Around line 438-448: The current code unconditionally caches errors from
out.err under robotsTTLNeg which allows transient context cancellations to
poison jm.robotsCache; change the logic in the robots caching block to detect
context.Canceled and context.DeadlineExceeded (use errors.Is(out.err,
context.Canceled) || errors.Is(out.err, context.DeadlineExceeded)) and do not
write a negative cache entry for those transient errors (either skip writing to
jm.robotsCache or set ttl to zero and avoid adding an entry unless ttl>0).
Update the conditional around jm.robotsCache assignment (referencing
robotsTTLPos, robotsTTLNeg, out.err, robotsCacheEntry, jm.robotsMutex,
jm.robotsCache) so only non-transient errors are cached with robotsTTLNeg.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 826131a5-6589-4283-aeed-cb4203c98137

📥 Commits

Reviewing files that changed from the base of the PR and between 41af55a and 0e79222.

📒 Files selected for processing (3)
  • CHANGELOG.md
  • internal/jobs/manager.go
  • internal/jobs/robots_cache_test.go

Comment thread internal/jobs/manager.go Outdated
Comment thread internal/jobs/manager.go Outdated
@github-actions
Copy link
Copy Markdown
Contributor

🐝 Review App Deployed

Homepage: https://hover-pr-385.fly.dev
Dashboard: https://hover-pr-385.fly.dev/dashboard

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@internal/jobs/robots_cache_test.go`:
- Around line 175-192: The test currently drops errors from concurrent calls to
GetRobotsRules (the goroutine uses "_, _ =" so failures are ignored); change the
goroutines to capture the returned error (call jm.GetRobotsRules(ctx,
"swarm.com") and send the error into a buffered channel or append to a protected
slice) and after wg.Wait() drain and assert on those errors (e.g., ensure all
are nil or match the expected error when the stub simulates failure). Use the
existing fanout, wg, gate and stub.calls identifiers to locate the loop and
assertions to replace the discarded errors with explicit checks.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 15acd520-5d6e-41a5-9e3e-e9d3e49f011c

📥 Commits

Reviewing files that changed from the base of the PR and between 0e79222 and d04f5c7.

📒 Files selected for processing (2)
  • internal/jobs/manager.go
  • internal/jobs/robots_cache_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • internal/jobs/manager.go

Comment thread internal/jobs/robots_cache_test.go
@github-actions
Copy link
Copy Markdown
Contributor

🐝 Review App Deployed

Homepage: https://hover-pr-385.fly.dev
Dashboard: https://hover-pr-385.fly.dev/dashboard

1 similar comment
@github-actions
Copy link
Copy Markdown
Contributor

🐝 Review App Deployed

Homepage: https://hover-pr-385.fly.dev
Dashboard: https://hover-pr-385.fly.dev/dashboard

@simonsmallchua simonsmallchua merged commit b07ebb6 into main May 12, 2026
21 checks passed
@simonsmallchua simonsmallchua deleted the work/robots-cache branch May 12, 2026 00:10
simonsmallchua added a commit that referenced this pull request May 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant